The simple linear model is a statistical model that assumes a linear relationship between the input/x/independent variable and the output/y/dependent variable. Most of the data you will work with will include a model that has one numerical y variable. Where the messy part comes in is how the characteristics of the x variables determine the type of model you have and the types of statistical tests you will make. If you have one categorical variable that have two categories (levels) that is known as a t-test.
Let’s take another look at the fictitious crop data. We will focus on one x variable from the data (e.g., density) and how it impacts the y variable (yield).
library(readxl)
library(here)
## here() starts at /Users/conorfair/Library/CloudStorage/OneDrive-UniversityofGeorgia/Active Projects/Data Analysis in R
crop <- read_excel(here("data", "crop.xlsx"), sheet = "crop")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(crop)
## Rows: 96
## Columns: 6
## $ location <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ block <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,…
## $ density <chr> "High", "High", "High", "Low", "Low", "Low", "High", "High"…
## $ fertilizer <dbl> 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2,…
## $ yield <dbl> 177.5500, 176.0443, 182.0796, 170.2287, 176.4793, 177.1042,…
## $ yield_ms <dbl> 177.5500, 176.0443, 182.0796, 170.2287, 176.4793, 177.1042,…
We need to make sure the data are loaded correctly - since a t-test is used with a categorical x variable we are set with how the density variable is coded (chr).
We will now visualize the data to better understand what is going on - good to get in the habit. This also helps guide the language for our question: Is there a difference in yield between the high and low density treatments? We can formulate a null and alternative hypothesis for this question:
Null: there is no difference in the yield between the high and low density treatments Alternative: there is a difference in the yield between the high and low density treatments
library(ggplot2)
ggplot(crop, aes(x=density, y = yield))+
geom_boxplot()+
theme_classic()
Now that we have the null and alternative hypotheses and a better understanding of the data we can move onto the actual statistical test. The code is relatively straight-forward.
t.test(yield ~ density, data = crop)
##
## Welch Two Sample t-test
##
## data: yield by density
## t = 4.4062, df = 86.618, p-value = 2.999e-05
## alternative hypothesis: true difference in means between group High and group Low is not equal to 0
## 95 percent confidence interval:
## 3.055142 8.077103
## sample estimates:
## mean in group High mean in group Low
## 179.3923 173.8262
We would interpret the results following this example language:
“The null hypothesis that there was no difference in the yield between the high and low density treatments was refuted (t = 4.4062, df = 86.618, p-value = 0.00002999).”
It is important to report the test statistic, degrees of freedom and p-value together - each piece of information is meaningless without the other. This is only the statistical interpretation of the results. I would also encourage you to report the effect size in the context of which the data were collected. Such language could look like this:
“The difference in yield between the high and low density treatments was 5.5661 with a 95% confidence interval of 3.055142 to 8.077103.”
The inclusion of the 95% confidence interval provides some useful context to say that the true effect (difference between high and low density) has a 95% chance of being between those values.
There are important variations of the t-test that you can review if you are interested:
t.test()
function assumes unequal variance that you can set using
var.equal=TRUE if the variance between levels of the
categorical variable is equal (as determined by the
leveneTest() or other methods)When you increase the number of levels in a categorical variable beyond two, then the type of linear model you are fitting is known as an ANOVA. From the crop data set we have another categorical variable that has more than two levels - fertilizer, but it is not coded as a categorical variable (chr).
crop <- crop %>%
mutate(fertilizer = as.character(fertilizer))
Visualizing the data once again will help us understand the data and write our question and hypotheses.
Question: Is there a difference in yield among the different fertilizer treatments? Null hypothesis: there is no difference in the yield among the different fertilizer treatments. Alternative hypothesis: there is at least one difference among the different fertilizer treatments.
crop %>%
group_by(fertilizer) %>%
summarize(mean = mean(yield),
std.dev = sd(yield),
count = n(),
std.error = sd(yield) / sqrt(n()))
## # A tibble: 3 × 5
## fertilizer mean std.dev count std.error
## <chr> <dbl> <dbl> <int> <dbl>
## 1 1 175. 5.63 32 0.995
## 2 2 175. 7.69 32 1.36
## 3 3 180. 6.01 32 1.06
ggplot(crop, aes(x=fertilizer, y = yield))+
geom_boxplot()+
theme_classic()
These hypotheses tell us what specific results we are looking for
from the statistical test. The function we use is aptly named
lm() (linear model). Similar to the t-test there are
assumptions that need to be tested in order to place confidence in the
results from the model. When these assumptions are violated there is a
chance that the estimated effects and/or standard errors can be biased
which can lead to wrong claims both in terms of the biological and
statistical interpretation.
(**): major concern that needs to be addressed - consider other
modeling options (e.g., mixed effects models, quadratic terms,
generalized additive models, etc.)
(*): test statistics and confidence intervals are less reliable -
consider glms to resolve issues
(+): will produce biased estimates - consider sensitivity analysis
(-): rare to have “perfect” multicollinearity
(.): hard to control - best practice is to be transparent in presenting
doubts/caveats about interpretations of your results
To assess these assumptions we use various visual and statistical
tools. Once you have ran a model, you can test if the model residuals
have met the assumptions. The plot() function will create
multiple plots of residuals to illustrate the relationship between
fitted values/quantiles and the residuals. The Residuals vs Fitted plot
will look for linear relationship between fitted values and residuals,
but the line through the cloud of points should be flat. The Q-Q
Residuals plot assesses distribution of the residuals, and they are said
to follow a normal distribution if the points follow the dashed line.
The Residuals vs Fitted and Scale-Location plots can both be used to
assess the constant variance assumption, and the spread of the points
should be equal across the fitted values (each group) if they are to be
assumed to be homoscedastic. The Leverage plot is used to test for
outliers, and points that fall outside of cook’s distance (usually
marked on the plot) are identified as potential outliers. The
performance package also has the check_model()
function that produces similar plots, and the details for this function
can be found in the help file ?check_model.
One major consideration with these visual tools is that they loose their usefulness when dealing with small sample sizes. If you are uncertain in your interpretation of the figures, there are related statistical tests that can help make a determination for if the assumptions are violated.
model <- lm(yield ~ fertilizer, data = crop)
#plot(model) # you have to hit enter in the console to cycle through each plot
par(mfrow = c(2,2)) # creates a 2x2 arrangement of plots
plot(model)
par(mfrow = c(1,1)) # returns to default settings of plot window
library(performance)
check_model(model, detrend = F)
These are some of the tests available when the visualization methods above are inconclusive.
# Test for normally distributed residuals
shapiro.test(residuals(model)) # test the residuals from the model - not the model or raw data
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.78209, p-value = 1.298e-10
#Tests for homogeneity of variance
car::leveneTest(model)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 0.171 0.843
## 93
car::ncvTest(model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.849659, Df = 1, p = 0.35665
#Test for outliers
car::outlierTest(model)
## rstudent unadjusted p-value Bonferroni p
## 35 -5.078733 1.9777e-06 0.00018986
## 22 -4.187232 6.4611e-05 0.00620270
## 53 -3.709464 3.5511e-04 0.03409100
The general approach should be to interpret the residuals from the model for each assumption and attempt to resolve one (if any) violated assumption at a time, then re-fit the model and re-assess the assumptions. These are fictitious data, but we will review this process with real data so you get an idea of how to interpret these plots. We will move forward with the analysis with this example data for now.
There are multiple pieces of statistical information that you can
gather from the model output. The summary() function will
produce output that lists the model formula, descriptive statistics of
the residuals, table of model coefficients, and other statistics about
the overall model fit. The anova() function will produce
output that reports the F statistic and other results from the ANOVA
table. There is another related function Anova() from the
car package that will allow you to extract different output
based on more complex models (interactions or unequal replication
experimental designs). For this simple model, both functions produce the
same output.
summary(model)
##
## Call:
## lm(formula = yield ~ fertilizer, data = crop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.887 -1.966 1.110 2.252 17.832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 175.4445 1.1497 152.599 <2e-16 ***
## fertilizer2 -0.5738 1.6259 -0.353 0.7249
## fertilizer3 4.0679 1.6259 2.502 0.0141 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.504 on 93 degrees of freedom
## Multiple R-squared: 0.09435, Adjusted R-squared: 0.07488
## F-statistic: 4.845 on 2 and 93 DF, p-value: 0.009967
anova(model)
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## fertilizer 2 409.8 204.919 4.8446 0.009967 **
## Residuals 93 3933.8 42.299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(model)
## Anova Table (Type II tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## fertilizer 409.8 2 4.8446 0.009967 **
## Residuals 3933.8 93
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When deciding what statistics to report, you need to consider the
specific language of your alternative hypothesis: “there is at least one
difference among the different fertilizer treatments”. This asks if
there is significant variation explained among the different levels for
the fertilizer explanatory variable. The output from the
summary() function compares each level of fertilizer to the
(Intercept) or reference level of the categorical variable
(fertilizer1). This is not in line with the language of your hypothesis,
so it isn’t the right statistics to report. The output from the
anova() function tests if the variance between group means
is significantly larger than the variability within each group, and
given the significant result, there is a significant difference among
the groups tested. This statistical result matches the language of the
hypothesis, and therefore should be reported! The null hypothesis that
there was no difference among the different fertilizer treatments was
refuted (F(2,93) = 4.8446, p-value = 0.009967). The F statistic is
reported along with the two degrees of freedom (numerator and
denominator from the ANOVA table) along with the p-value.
Often additional comparisons are made after an analysis is made - these are called post-hoc comparisons. This step can assess the difference between the different levels of the different categorical variables. For each comparison made there are p-value adjustments that are made to account for the increased false-positive rate when you have multiple simultaneous tests. The most common example you may have heard of is the Tukey HSD. If you have planned comparisons before the data are collected, then you can use orthogonal contrasts to compare groups and properly adjust the p-values using Tukey HSD or some other tool.
To produce these post-hoc comparisons, the emmeans()
function from the emmeans package and the
cld() function from the multcomp package are
used in tandem to identify which pairwise treatment groups are
significantly different from each other. This code can then be used in
combination with ggplot to produce an informative figure that includes
the raw data, mean estimates from the model, and evidence of
statistically significant differences.
library(emmeans)
## Welcome to emmeans.
## Caution: You lose important information if you filter this package's results.
## See '? untidy'
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
library(stringr)
crop_means <- emmeans(model, ~ fertilizer)%>% # calculates the estimated marginal means
cld(Letters = letters, decreasing = TRUE)%>% # produces compact letter display
as_tibble() # changes crop_means to tibble data object to be read into ggplot
# Create a vector of labels for fertilizer labels - different from how the data are coded in crop
Fertilizer_Labels <- c("Control", "Fertilizer A", "Fertilizer B")
fertilizer_plot <- ggplot()+
geom_point(data = crop, aes(x = fertilizer, y = yield))+
geom_point(data = crop_means, aes(x = fertilizer, y = emmean), col = "red", size = 3, position = position_nudge(x = 0.1))+
geom_errorbar(data = crop_means, aes(x = fertilizer, ymin = emmean - SE, ymax = emmean + SE), width = 0.1, col = "red", position = position_nudge(x = 0.1))+
geom_text(data = crop_means, aes(x = fertilizer, y = emmean, label=str_trim(.group)), position = position_nudge(x=0.2))+ # adds the labels from the cld function to indicate which groups are significantly different
labs(x = "Fertilizer", y = "Mean Yield (kg) \u00B1(SE)")+ # "\u00B1" is code that produces the plus/minus symbol
scale_x_discrete(labels = Fertilizer_Labels)+
theme_classic()
We have included a few additional pieces of code to produce a more informative figure. The estimated marginal means are the same as the arithmetic means as calculated above, but the standard errors are different since they are calculated from the pooled variance from the model rather than the individual group variances. The labels for the x and y axes have also been modified to reflect more information. Lastly the letters from the compact letter display function help to interpret which levels are significantly different from each other. When we look at the output before the tibble object is made we see the following message, “If two or more means share the same grouping symbol, then we cannot show them to be different. But we also did not show them to be the same.” These figures will often have language in a caption that will look something like this, “Groups with different letters indicate significant differences (alpha = 0.05).” The alpha level tells the reader that the threshold for significance is set at 0.05 - which is a common norm. Lastly the cld output tells us that the p value adjustment was made using the “tukey method for comparing a family of 3 estimates.” This information should be reported in the language that describes the statistical analysis. Example language would look like this, “To test the null hypothesis that there are no differences in yield among levels of fertilizer, a linear model was fit with the independent variable of fertilizer and dependent variable of yield. Results from the the ANOVA table and the F statistic are used to test the hypothesis. Additional post-hoc comparisons using the emmeans function from the emmeans package (Lenth 2025) and the cld function from the multcomp package (Hothorn et al. 2008) using the tukey hsd post-hoc p-value adjustment show which levels are significantly different from each other. Data analyis and visualization was completed using R version 4.4.2 ‘Pile of Leaves’ (R Core Team 2024).”
There is room for your own voice in writing these sections of your manuscripts, but pay attention to the details that are crucial that ensure your audience understand what was done and that your analysis could be replicated if needed. Also, make sure to include the citations for the R software you use and the packages. There is a lot of effort that goes into the development and maintenance of these tools, and that effort deserves recognition.
When you have multiple categorical x variables that is known as a two-way ANOVA. There are some additional considerations to make with this analysis.
These data come from a classic data set used in many introductory data analysis seminars. These data show the body mass in grams among different penguin species on different islands and between sexes of penguins. There are other variables recorded that could be analyzed, but we will be focusing on the species, sex, and body mass (g) of the penguins for now.
Note: These data aren’t related to entomology, but it isn’t a big stretch to apply the comparisons in body mass between species and sexes of an organism. This translation of the type of data from an example data set to your own data is a valuable skill, and by getting better at recognizing similarities between the example data set and your own, you will improve your ability to use other resources to help with your own data analysis.
library(readr)
Penguins <- read_csv(here("data", "Two_Way_ANOVA.csv"), col_types = "ccddddcc")
str(Penguins)
## spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ species : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
## $ island : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
## $ sex : chr [1:344] "male" "female" "female" NA ...
## $ year : chr [1:344] "2007" "2007" "2007" "2007" ...
## - attr(*, "spec")=
## .. cols(
## .. species = col_character(),
## .. island = col_character(),
## .. bill_length_mm = col_double(),
## .. bill_depth_mm = col_double(),
## .. flipper_length_mm = col_double(),
## .. body_mass_g = col_double(),
## .. sex = col_character(),
## .. year = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Remove NAs within the sex variable
Penguins <- subset(Penguins, !is.na(sex))
We confirmed the coding of each variable is correct - species and sex are categorical and body mass (g) is continuous.
Initial steps to understand the data to better write our question(s) and hypotheses. We will visualize the data before writing the questions and hypotheses.
Penguin_Means <- Penguins %>%
group_by(species,sex)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
ggplot(Penguins, aes(x = species, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(group = sex), position = position_dodge(width = 0.5), alpha = 0.4)+
geom_point(data = Penguin_Means, aes(x = species, y = mean_mass, col = sex, group = sex), size = 4, position = position_dodge(width = 0.5))+
labs(x = "Species", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female","Male"), palette = "Dark2")+
theme_classic()
colorBlindness::cvdPlot()
We used a few more data visualization tools to make this figure. The
position argument in the geom_point()
functions dodge the points by the grouping variable (sex). We used the
function scale_color_brewer() to change the labels and
color palette for the legend. You can read the help page for this
function to learn more about what it can do - it is a great tool to
choose color palettes that consider how people with different abilities
see color
After reviewing the plot we can see the different types of data and how that leads into the types of questions that can be asked with the data.
Given the following three questions, write the null and alternative hypotheses.
Question 1: is there a difference in body mass in grams among the different species of penguins?
Null hypothesis: Alternative hypothesis:
Question 2: is there a difference in body mass in grams among the different sexes of penguins?
Null hypothesis: Alternative hypothesis:
Question 3: is the difference in body mass in grams between species the same for each sex of penguins?
Null hypothesis: Alternative hypothesis:
By including multiple terms in the model, this quickly complicates the analyses you can test.
Mod_Penguin <- lm(body_mass_g ~ species * sex, data = Penguins)
After fitting the model the assumptions should be reviewed prior to interpreting the results.
par(mfrow = c(2,2)) # creates a 2x2 arrangement of plots
plot(Mod_Penguin)
library(performance)
check_model(Mod_Penguin, detrend = F)
These results look good - normally distributed residuals,
homoscedatiscity, and linear trend in residuals. We can now move onto
interpreting the results from the model. The summary()
function produces the parameter estimates, but the interpretation of
these estimates doesn’t perfectly align with the above hypothesis - we
will come back to this. The anova() and
Anova() function from the stats and
car package produce output from the Anova table, but there
are some (at times) critical differences between these two functions.
Specifically, when you have unbalanced replication (missing values) or
an interaction term in your model, you should use the
Anova() function to produce the type III sums of squares.
This link
can provide more details for your reference.
anova(Mod_Penguin)
## Analysis of Variance Table
##
## Response: body_mass_g
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 145190219 72595110 758.358 < 2.2e-16 ***
## sex 1 37090262 37090262 387.460 < 2.2e-16 ***
## species:sex 2 1676557 838278 8.757 0.0001973 ***
## Residuals 327 31302628 95727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Mod_Penguin)
## Anova Table (Type II tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## species 143401584 2 749.016 < 2.2e-16 ***
## sex 37090262 1 387.460 < 2.2e-16 ***
## species:sex 1676557 2 8.757 0.0001973 ***
## Residuals 31302628 327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Mod_Penguin, type = 2)
## Anova Table (Type II tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## species 143401584 2 749.016 < 2.2e-16 ***
## sex 37090262 1 387.460 < 2.2e-16 ***
## species:sex 1676557 2 8.757 0.0001973 ***
## Residuals 31302628 327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Mod_Penguin, type = 3)
## Anova Table (Type III tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## (Intercept) 828480899 1 8654.649 < 2.2e-16 ***
## species 60350016 2 315.220 < 2.2e-16 ***
## sex 16613442 1 173.551 < 2.2e-16 ***
## species:sex 1676557 2 8.757 0.0001973 ***
## Residuals 31302628 327
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(Mod_Penguin)
##
## Call:
## lm(formula = body_mass_g ~ species * sex, data = Penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -827.21 -213.97 11.03 206.51 861.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3368.84 36.21 93.030 < 2e-16 ***
## speciesChinstrap 158.37 64.24 2.465 0.01420 *
## speciesGentoo 1310.91 54.42 24.088 < 2e-16 ***
## sexmale 674.66 51.21 13.174 < 2e-16 ***
## speciesChinstrap:sexmale -262.89 90.85 -2.894 0.00406 **
## speciesGentoo:sexmale 130.44 76.44 1.706 0.08886 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 309.4 on 327 degrees of freedom
## Multiple R-squared: 0.8546, Adjusted R-squared: 0.8524
## F-statistic: 384.3 on 5 and 327 DF, p-value: < 2.2e-16
The results from the Anova table tells us that the inclusion of the interaction term of species and sex explains a significant amount of variation. It doesn’t say which levels are significantly different.
The results from the tables of coefficients that is produced from the
summary() function show an estimate for the
(Intercept) two species, one sex, and then two species:sex
combinations. These estimates follow the default treatment coding that R
does behind the scenes when you include multiple categorical variables
in a model. The interpretation of these estimates is that the
(Intercept) refers to the reference level of each
categorical variable (species:Adelie, sex:female). Every other estimate
is the change in the response compared to this reference level. For
example: The estimated body mass for the average female Adelie penguin
is 3,368.84 (g), and the estimated body mass for the average female
Gentoo penguin is 4,679.75 (3386.84+1310.91). When you
consider the interaction effect the estimated body mass for the average
male Chinstrap penguin is 3,780.61
(3386.84+674.66-262.89).
To look further into the individual levels of each variable we can use the estimated marginal means functions and post-hoc comparisons tests we used before with some minor modifications.
After we confirmed that the interaction effect of the model was
significant we can use the emmeans() function to produce
unbiased estimates. The rejection of the null hypothesis that body mass
is the same across the species and sexes suggests that there is a
difference among the levels of one variable across the other
variable.
When you have a significant interaction effect you shouldn’t analyze
the main effects independently since their interpretation is dependent
on the other effect. This means that we want to include both terms from
the interaction term in the emmeans() function. The
emmeans() function produces the estimated marginal mean,
SE, df, lower and upper confidence interval for each species and sex.
The cld() function used in conjunction with the
emmeans() function will produce a compact letter display of
all pair-wise comparisons. You can adjust the default setting (from
FALSE to TRUE) to show each pairwise comparison and include the
decreasing argument if you want the largest estimated marginal mean to
be listed first.
When you have an interaction effect you want to test pair-wise comparisons there is some additional coding options. The first set of comparisons shown includes all possible comparisons with every species and sex combination tested (15 total comparisons). The other option shows nested comparisons where species are compared within each level of sex.
Note: The selection of these comparisons should be driven by your hypotheses and what makes theoretical sense. These decisions should ideally be made during or soon after the experimental design stage. You don’t want the presence or absence of a significant effect here to inform your decisions.
Mod_Penguin%>%
emmeans(~species : sex)
## species sex emmean SE df lower.CL upper.CL
## Adelie female 3369 36.2 327 3298 3440
## Chinstrap female 3527 53.1 327 3423 3632
## Gentoo female 4680 40.6 327 4600 4760
## Adelie male 4043 36.2 327 3972 4115
## Chinstrap male 3939 53.1 327 3835 4043
## Gentoo male 5485 39.6 327 5407 5563
##
## Confidence level used: 0.95
Mod_Penguin%>%
emmeans(~species : sex)%>%
cld(Letters = letters, details = TRUE)
## $emmeans
## species sex emmean SE df lower.CL upper.CL .group
## Adelie female 3369 36.2 327 3298 3440 a
## Chinstrap female 3527 53.1 327 3423 3632 a
## Chinstrap male 3939 53.1 327 3835 4043 b
## Adelie male 4043 36.2 327 3972 4115 b
## Gentoo female 4680 40.6 327 4600 4760 c
## Gentoo male 5485 39.6 327 5407 5563 d
##
## Confidence level used: 0.95
## P value adjustment: tukey method for comparing a family of 6 estimates
## significance level used: alpha = 0.05
## NOTE: If two or more means share the same grouping symbol,
## then we cannot show them to be different.
## But we also did not show them to be the same.
##
## $comparisons
## contrast estimate SE df t.ratio p.value
## Chinstrap female - Adelie female 158 64.2 327 2.465 0.1376
## Chinstrap male - Adelie female 570 64.2 327 8.875 <.0001
## Chinstrap male - Chinstrap female 412 75.0 327 5.487 <.0001
## Adelie male - Adelie female 675 51.2 327 13.174 <.0001
## Adelie male - Chinstrap female 516 64.2 327 8.037 <.0001
## Adelie male - Chinstrap male 105 64.2 327 1.627 0.5812
## Gentoo female - Adelie female 1311 54.4 327 24.088 <.0001
## Gentoo female - Chinstrap female 1153 66.8 327 17.246 <.0001
## Gentoo female - Chinstrap male 741 66.8 327 11.085 <.0001
## Gentoo female - Adelie male 636 54.4 327 11.691 <.0001
## Gentoo male - Adelie female 2116 53.7 327 39.425 <.0001
## Gentoo male - Chinstrap female 1958 66.2 327 29.564 <.0001
## Gentoo male - Chinstrap male 1546 66.2 327 23.345 <.0001
## Gentoo male - Adelie male 1441 53.7 327 26.855 <.0001
## Gentoo male - Gentoo female 805 56.7 327 14.188 <.0001
##
## P value adjustment: tukey method for comparing a family of 6 estimates
All_means_Penguins <- Mod_Penguin %>%
emmeans(~species : sex) %>%
cld(Letters = letters, decreasing = TRUE) %>%
as_tibble()
Within_means_Penguins <- Mod_Penguin %>%
emmeans(~species | sex) %>%
cld(Letters = letters, decreasing = TRUE) %>%
as_tibble()
After you have chosen which method best reflects what matches your
hypothesis and/or theory, you can modify the cld object into a tibble,
which is much easier to use with ggplot() and make an
informative figure.
library(ggpp) # allows for the additional position option position_dodgenudge
## Registered S3 methods overwritten by 'ggpp':
## method from
## heightDetails.titleGrob ggplot2
## widthDetails.titleGrob ggplot2
##
## Attaching package: 'ggpp'
## The following object is masked from 'package:ggplot2':
##
## annotate
Plot_All_Means_Penguins<-ggplot() +
geom_point(data = Penguins, aes(y = body_mass_g, x = species, color = sex), position = position_dodgenudge(width = 0.9, x = 0)) +
geom_boxplot(data = Penguins, aes(y = body_mass_g, x = species, color = sex), width = 0.25, position = position_dodgenudge(width = 0.5, x = 0)) +
geom_point(data = All_means_Penguins, aes(y = emmean, x = species, group = sex), position = position_dodgenudge(width = 0.5, x = 0), size = 2,color = "red") +
geom_errorbar(data = All_means_Penguins, aes(ymin = lower.CL, ymax = upper.CL, x = species, group = sex), position = position_dodgenudge(width = 0.5, x = 0), color = "red", width = 0.1) +
geom_text(data = All_means_Penguins, aes(y = emmean, x = species, group = sex, label = str_trim(.group)), color = "black", position = position_dodgenudge(width = 1.2, x = -0.025), hjust = 0) +
labs(y = "Body Mass (g)", x = "Species", color = "Sex") +
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2") +
theme_classic()
colorBlindness::cvdPlot()
## Warning: `position_dodge()` requires non-overlapping x intervals.
Plot_Within_Means_Penguins<-ggplot() +
facet_wrap(~sex, labeller = label_both) +
geom_point(data = Penguins, aes(y = body_mass_g, x = species, color = species)) +
geom_boxplot(data = Penguins, aes(y = body_mass_g, x = species, color = species), width = 0.25, position = position_nudge(x = 0.2)) +
geom_point(data = Within_means_Penguins, aes(y = emmean, x = species), position = position_nudge(x = 0.2), size = 2,color = "red") +
geom_errorbar(data = Within_means_Penguins, aes(ymin = lower.CL, ymax = upper.CL, x = species), position = position_nudge(x = 0.2), color = "red", width = 0.1) +
geom_text(data = Within_means_Penguins, aes(y = emmean, x = species, label = str_trim(.group)), color = "black", position = position_nudge(x = 0.4), hjust = 0) +
labs(y = "Body Mass (g)", x = "Species", color = "Species") +
scale_color_brewer(palette = "Dark2") +
theme_classic()
# Use the ggsave and here functions to save these plots in the "outputs" folder
# The shell of these functions are provided, but you need to include the appropriate arguments in the right sequence
#ggsave(here(),)
#ggsave(here(),)
Both output are incorporated into their respective figures. The first
figure shows all pair-wise comparisons, and the second shows the
comparisons among species within each sex. The
position = posiiton_dodgenudge argument is useful in the
first figure to both dodge and nudge the elements of the figure. The
facet_wrap() function is useful in the second figure to
produce multiple figures for each level of another factor variable -
this function is very powerful in creating complex arrangements of
figures and is worth researching further if you are interested
?facet_wrap.
These figures shows a lot of information: raw data, box plots (interquartile range and median), and estimated marginal means with their associated pair-wise comparisons based on t-tests with corrected p-values. You often see some of the following language include in a caption of these figures to describe the interpretation of the different letters, “Different letters indicate significant differences at P <= 0.05 as determined by multiple comparisons tests using the Tukey HSD method for adjusting the experiment-wise error rate.”
When you have a continuous x variable, then you are still working with a linear model, but the interpretation of the results is a little different.
When you have one or more independent variables in the linear model that are all continuous that is known as linear regression. Simple linear regression has one independent variable, and multiple linear regression has multiple independent variables.
Data from (Kniss et al. 2012) shows the sugar yield of sugar beet in response to volunteer corn density. The response variable is sucrose production in pounds per acre (LbsSucA), and the independent variable is volunteer corn density in plants per foot of row.
library(readr)
Beets <- read_csv(here("data", "Simple_Linear_Regression.csv"), col_types = "cdd")
str(Beets)
## spc_tbl_ [24 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Loc : chr [1:24] "WY09" "WY09" "WY09" "WY09" ...
## $ Density: num [1:24] 0 0.03 0.05 0.08 0.11 0.15 0.08 0.11 0 0.05 ...
## $ LbsSucA: num [1:24] 10106 8639 7752 5718 7954 ...
## - attr(*, "spec")=
## .. cols(
## .. Loc = col_character(),
## .. Density = col_double(),
## .. LbsSucA = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
summary(Beets)
## Loc Density LbsSucA
## Length:24 Min. :0.000 Min. : 3495
## Class :character 1st Qu.:0.030 1st Qu.: 5931
## Mode :character Median :0.065 Median : 7638
## Mean :0.070 Mean : 7232
## 3rd Qu.:0.110 3rd Qu.: 8297
## Max. :0.150 Max. :10106
The data are coded correctly - both Density and LbsSucA are numerical
(continuous) variables, and the Loc (location) variable has no variation
- we will ignore it. Since there is no categorical variable of interest
the summary() output gives us the descriptive statistics. A
figure to visualize the data is still a helpful tool.
ggplot(Beets, aes(x = Density, y = LbsSucA))+
geom_point()+
ylim(0,NA)+
theme_classic()
# Does it make sense to report yield in units of pounds/acre?
# Might be a meaningful unit for the field.
# Can change to other unit
Beets <- Beets%>%
mutate(KgSucA = LbsSucA*0.453592)
ggplot(Beets, aes(x = Density, y = KgSucA))+
geom_point()+
ylim(0,NA)+
labs(y = "Kg Sucrose/Acre")+
theme_classic()
From here we can formulate the question and hypothesis: does the density of volunteer corn density affect the yield of sugar beet?
Null hypothesis: there is no effect of density on sugar beet yield Alternative hypothesis: there is an effect of density on sugar beet yield
Now we will fit the model, test assumptions, and discuss what statistics should be reported to address this set of hypotheses.
model_beets <- lm(LbsSucA ~ Density, data = Beets)
library(performance)
check_model(model_beets, detrend = F)
anova(model_beets)
## Analysis of Variance Table
##
## Response: LbsSucA
## Df Sum Sq Mean Sq F value Pr(>F)
## Density 1 25572276 25572276 13.374 0.001387 **
## Residuals 22 42066424 1912110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_beets)
##
## Call:
## lm(formula = LbsSucA ~ Density, data = Beets)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2086.02 -1084.01 23.92 726.23 3007.73
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8677.6 485.6 17.869 1.38e-14 ***
## Density -20644.7 5645.2 -3.657 0.00139 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1383 on 22 degrees of freedom
## Multiple R-squared: 0.3781, Adjusted R-squared: 0.3498
## F-statistic: 13.37 on 1 and 22 DF, p-value: 0.001387
# Write the analogous code to fit and assess a linear model for Kg Sucrose per Acre as the response (y) variable and Density as the explanatory (x) variable.
# Changing the response variable will affect the estimates, standard errors, sums of squares, and mean squares, but not the test statistics
Regardless of which unit you choose to report, the output from the ANOVA table tells you that the effect of Density has a significant effect on sugar beet yield. “The null hypothesis that there is no effect of density on sugar beet yield was refuted (F(1,22) = 13.374, p-value = 0.001387).” There is more information in the summary output in the interpretation of the estimate. The interpretation of these estimates usually follow this language, “for a one unit change in Density, there is a resultant decrease of 9,364.3 Kg of sugar beet yield (t = 3.657, df = 1, p-value = 0.00139).” The sign of the t statistic tells you the direction of the relationship, so you only need to report the absolute value. For a continuous variable there is only one degree of freedom. Then, don’t forget to include the p-value.
Now the scale of the Density variable doesn’t include a one-unit change, and a decrease of over 9,000 kg is over twice the maximum yield observed. We could divide every value in that sentence by ten since that would be within the observed range. “For a 0.1 unit change in Density, there is a resultant decrease of 936.43 Kg of sugar beet yield…”
Now we can create a figure that incorporates the results from the
model with the visualization of the data. When you have a continuous
variable the predict() function from the stats
package can be used to predict values from the model.
# Create new data - sequence of Density values from min to max
NewData <- expand.grid(Density = seq(min(Beets$Density),max(Beets$Density), length.out = 1000))
# Create predicted values of yield from model using new data
yhat <- as.data.frame(predict(model_beets, newdata = NewData, se.fit = T))
# Combine the two data sets
NewData <- cbind(NewData,yhat)
# Another way to rename the fit variable
NewData$KgSucA <- NewData$fit
head(NewData)
## Density fit se.fit df residual.scale KgSucA
## 1 0.0000000000 8677.605 485.6201 22 1382.791 8677.605
## 2 0.0001501502 8674.505 484.9306 22 1382.791 8674.505
## 3 0.0003003003 8671.406 484.2417 22 1382.791 8671.406
## 4 0.0004504505 8668.306 483.5532 22 1382.791 8668.306
## 5 0.0006006006 8665.206 482.8652 22 1382.791 8665.206
## 6 0.0007507508 8662.106 482.1777 22 1382.791 8662.106
Beet_plot <- ggplot()+
geom_point(data = Beets, aes(x = Density, y = KgSucA))+
geom_line(data = NewData, aes(x=Density, y = KgSucA), col = "blue")+
geom_ribbon(data = NewData, aes(x=Density, ymin = KgSucA - se.fit, ymax = KgSucA + se.fit), fill = "blue", alpha = 0.3)+
theme_classic()
ggsave(here("outputs", "Beet_plot.tiff"), Beet_plot, width = 20, height = 15, units = "cm", dpi = 300)
Lastly, we can save the image produced in the plot tab using the
ggsave() function in combination with the
here() function. The image file will be saved in the
“outputs” folder and named “Beet_plot.tiff” with a width of 20 cm,
height of 15 cm, and resolution of 300 dpi. These settings can be
changed to fit sizing and resolution preferences or limitations for
presentations and/or manuscripts.
We mentioned the possibility of having multiple independent variables in a linear model - multiple linear regression. When you have a mixture of continuous and categorical variables in the same linear model that is also known as ANCOVA (analysis of covariance). We will go over some of the specifics of how to formulate, test, and visualize these hypotheses.
Ancova <- read_csv(here("data", "ANCOVA.csv"))
## Rows: 84 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): loc
## dbl (2): density, yield
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(Ancova)
## # A tibble: 6 × 3
## density yield loc
## <dbl> <dbl> <chr>
## 1 23.5 223. P
## 2 26.2 234. P
## 3 27.8 222. P
## 4 32.9 222. P
## 5 33.3 197. P
## 6 36.8 190. P
Ancova <- Ancova %>%
mutate(loc = as.factor(loc))
We confirmed that each variable is the correct type of variable - density and yield are continuous, and location is categorical. We added additional code where we changed location from a character type variable to a factor type variable. This will be necessary for the final analysis and figures we make. The main difference between a character and factor type variable is that certain functions work with factor variables only not with character variables, and that with a factor variable you are able to set the order of the variables. This is valuable if you want to change which variable is the reference variable - this was discussed in the two-way anova example.
The dataset contains the results from a onion trial that tested the
effect of planting density (plants per square meter) at two locations
(Purnong Landing or Virginia). There is no information regarding the
spatial arrangement of the fields other than the two locations, so there
is no opportunity or need to use the desplot()
function.
library(ggtext)
ggplot(Ancova, aes(x = density, y = yield, group = loc, col = loc)) +
geom_point() +
labs(x = expression(paste("Planting Density/m"^"2")), y = "Yield", col = "Location") +
theme_classic()
We included another useful modification to ggplot labels using the
expression() function. This allows you to use mathematical
expressions in text arguments. This example allowed us to use a
superscript in the x-axis label. The specific syntax is an additional
set of quotation marks around the mathematical symbol, which was the
exponent symbol “^” in this example.
From the graph made above we see a negative relationship between planting density and yield. There is also two different locations and the negative relationship seems to be shared across both locations. We can test this hypothesis for both the negative relationship and see if the relationship is different between locations. Furthermore, we notice the shape of the relationship might not be exactly straight. We can fit a polynomial term or possibly a logarithmic term for the effect of density to see if that fits better than a linear relationship.
Remember that testing if the slope of these two lines are the same includes an interaction term in the model. It is important to develop these hypotheses prior to analyzing the data so that the presence or absence of a significant result doesn’t influence the type of analysis you perform. This is akin to p-hacking behaviors, which is considered unethical, and should be avoided.
Model_linear <- lm(yield ~ density * loc, data = Ancova, contrasts = list(loc = contr.sum))
check_model(Model_linear, detrend = F)
Model_log <- lm(yield ~ log(density) * loc, data = Ancova, contrasts = list(loc = contr.sum))
check_model(Model_log, detrend = F)
Model_power <- lm(yield ~ poly(density, 2) * loc, data = Ancova, contrasts = list(loc = contr.sum))
check_model(Model_power, detrend = F)
Improvements to the linearity graph are noticeable comparing both the log and power models to the linear model. Outliers and normality of residuals look fine for all models. Collinearity is going to be higher in the presence of an interaction term - take with a grain of salt. The homogeneity of variance figure is best with power model, but not terrible for log model.
summary(Model_linear)
##
## Call:
## lm(formula = yield ~ density * loc, data = Ancova, contrasts = list(loc = contr.sum))
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.054 -16.329 -7.433 14.720 110.485
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.67148 5.83592 34.386 < 2e-16 ***
## density -1.10139 0.06952 -15.843 < 2e-16 ***
## loc1 18.98811 5.83592 3.254 0.00167 **
## density:loc1 -0.03545 0.06952 -0.510 0.61148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.03 on 80 degrees of freedom
## Multiple R-squared: 0.768, Adjusted R-squared: 0.7593
## F-statistic: 88.27 on 3 and 80 DF, p-value: < 2.2e-16
summary(Model_log)
##
## Call:
## lm(formula = yield ~ log(density) * loc, data = Ancova, contrasts = list(loc = contr.sum))
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.361 -8.532 -0.343 7.869 69.960
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 485.991 13.620 35.683 < 2e-16 ***
## log(density) -88.379 3.256 -27.142 < 2e-16 ***
## loc1 40.477 13.620 2.972 0.00391 **
## log(density):loc1 -5.413 3.256 -1.662 0.10037
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.6 on 80 degrees of freedom
## Multiple R-squared: 0.9056, Adjusted R-squared: 0.9021
## F-statistic: 255.9 on 3 and 80 DF, p-value: < 2.2e-16
summary(Model_power)
##
## Call:
## lm(formula = yield ~ poly(density, 2) * loc, data = Ancova, contrasts = list(loc = contr.sum))
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.719 -9.717 0.843 8.556 80.295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.184 2.007 59.877 < 2e-16 ***
## poly(density, 2)1 -417.069 18.497 -22.548 < 2e-16 ***
## poly(density, 2)2 170.780 18.368 9.297 2.86e-14 ***
## loc1 17.912 2.007 8.924 1.52e-13 ***
## poly(density, 2)1:loc1 -35.418 18.497 -1.915 0.0592 .
## poly(density, 2)2:loc1 -6.094 18.368 -0.332 0.7410
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.15 on 78 degrees of freedom
## Multiple R-squared: 0.89, Adjusted R-squared: 0.8829
## F-statistic: 126.2 on 5 and 78 DF, p-value: < 2.2e-16
car::Anova(Model_linear, type = 3)
## Anova Table (Type III tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## (Intercept) 801033 1 1182.3683 < 2e-16 ***
## density 170050 1 251.0029 < 2e-16 ***
## loc 7172 1 10.5863 0.00167 **
## density:loc 176 1 0.2601 0.61148
## Residuals 54199 80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Model_log, type = 3)
## Anova Table (Type III tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## (Intercept) 350924 1 1273.2489 < 2.2e-16 ***
## log(density) 203046 1 736.7065 < 2.2e-16 ***
## loc 2434 1 8.8323 0.003908 **
## log(density):loc 762 1 2.7632 0.100370
## Residuals 22049 80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Model_power, type = 3)
## Anova Table (Type III tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## (Intercept) 1181410 1 3585.2269 < 2.2e-16 ***
## poly(density, 2) 195395 2 296.4824 < 2.2e-16 ***
## loc 26242 1 79.6372 1.516e-13 ***
## poly(density, 2):loc 1246 2 1.8912 0.1578
## Residuals 25703 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Model_linear, type = 2)
## Anova Table (Type II tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## density 170721 1 251.9934 < 2.2e-16 ***
## loc 22145 1 32.6872 1.796e-07 ***
## density:loc 176 1 0.2601 0.6115
## Residuals 54199 80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Model_log, type = 2)
## Anova Table (Type II tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## log(density) 202285 1 733.9450 < 2.2e-16 ***
## loc 26643 1 96.6693 2.055e-15 ***
## log(density):loc 762 1 2.7632 0.1004
## Residuals 22049 80
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
car::Anova(Model_power, type = 2)
## Anova Table (Type II tests)
##
## Response: yield
## Sum Sq Df F value Pr(>F)
## poly(density, 2) 198147 2 300.6580 < 2.2e-16 ***
## loc 26223 1 79.5794 1.538e-13 ***
## poly(density, 2):loc 1246 2 1.8912 0.1578
## Residuals 25703 78
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
All models fail to refute the null hypothesis that the relationship between density and yield is different between the two locations. Given the lack of a significant interaction the type II sums of squares provides a more powerful test. There is strong evidence for differences between the two locations and a negative relationship between density and yield. We will now discuss how to address the choice between which model to use for reporting final model results.
There are multiple options to use when determining which model should
be used to report estimates for the slope and intercepts - which model
has the best fit. Comparing similar models fit with the same data but
different terms using the anova() function (lower case a)
will compare the residual sums of squares and produce an F statistic if
there are different degrees of freedom - comparing the power model to
either the linear or log model. Another function from the
performance package compare_performance() will
allow you to compare several model fit statistics from a list of
multiple models. For these linear models that are fit with the
lm() function the adjusted R squared root mean squared
error (RMSE) and sigma are more commonly used for model selection. Terms
like AIC, AICc, and BIC are used with other types of models like
generalized linear models (glm) that are estimated using a likelihood
function. We will address these models in the next section.
anova(Model_linear, Model_power)
## Analysis of Variance Table
##
## Model 1: yield ~ density * loc
## Model 2: yield ~ poly(density, 2) * loc
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 54199
## 2 78 25703 2 28496 43.238 2.31e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
compare_performance(Model_linear, Model_log, Model_power)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2
## -----------------------------------------------------------------------------
## Model_linear | lm | 791.8 (<.001) | 792.6 (<.001) | 804.0 (<.001) | 0.768
## Model_log | lm | 716.3 (>.999) | 717.0 (>.999) | 728.4 (>.999) | 0.906
## Model_power | lm | 733.2 (<.001) | 734.6 (<.001) | 750.2 (<.001) | 0.890
##
## Name | R2 (adj.) | RMSE | Sigma
## ------------------------------------------
## Model_linear | 0.759 | 25.401 | 26.028
## Model_log | 0.902 | 16.202 | 16.602
## Model_power | 0.883 | 17.492 | 18.153
Based on the interpretation of the model fit statistics, the log model fits the data better. We will produce figures that use estimates from each model to demonstrate visually why and how the log model fits best.
When you have multiple variables in your model and you want to make
predictions from the model, you need to make sure that each term is
included in the expand.grid() function. Here we will create
a sequence of density values across the minimum and maximum for both
levels of location (remember that categorical variable need to be
factors to work within expand.grid() function).
# Create new data - sequence of Density values from min to max and over both levels of location variable
NewData <- expand.grid(density = seq(min(Ancova$density), max(Ancova$density), length.out = 1000), loc = levels(Ancova$loc))
# Create predicted values of yield with standard errors from model using new data
yhat <- as.data.frame(predict(Model_log, newdata = NewData, se.fit = T))
# Combine the two data sets
NewData <- cbind(NewData,yhat)
# Another way to rename the fit variable
NewData$yield <- NewData$fit
head(NewData)
## density loc fit se.fit df residual.scale yield
## 1 18.78000 P 251.3956 6.667895 80 16.60161 251.3956
## 2 18.94614 P 250.5695 6.629313 80 16.60161 250.5695
## 3 19.11227 P 249.7506 6.591106 80 16.60161 249.7506
## 4 19.27841 P 248.9388 6.553269 80 16.60161 248.9388
## 5 19.44454 P 248.1340 6.515796 80 16.60161 248.1340
## 6 19.61068 P 247.3361 6.478680 80 16.60161 247.3361
Ancova_Log_plot <- ggplot()+
geom_point(data = Ancova, aes(x = density, y = yield, col = loc))+
geom_line(data = NewData, aes(x=density, y = yield, col = loc))+
geom_ribbon(data = NewData, aes(x=density, ymin = yield - se.fit, ymax = yield + se.fit, fill = loc), alpha = 0.3)+
labs(x = expression(paste("Planting Density/m"^"2")), y = "Yield") +
guides(fill = guide_legend(title="Location"), color = guide_legend(title = "Location"))+
scale_color_brewer(labels = c("P", "V"), palette = "Dark2") +
scale_fill_brewer(labels = c("P", "V"), palette = "Dark2") +
theme_classic()
ggsave(here("outputs", "Ancova_log_plot.tiff"), Ancova_Log_plot, width = 20, height = 15, units = "cm", dpi = 300)
Another way to modify the legend title using the
guides() function. Since both the color and fill arguments
are used among the different geometries layered in the plot there needs
to be arguments for both in the guides() function. In this
instance they are the same, but that may not always be the case. This is
where increasingly complex figures can be put to use.
Now we will produce the output for the linear and power models to compare among the three models.
# Repeat the above process for the linear and power models
# Create the NewData object, predictions, combine the two sets of data, and then a plot that display the data.
# This code is one way to combine multiple plots into one figure
library(patchwork)
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
##
## area
# Ancova_Comparison_plot <- Ancova_Linear_plot/Ancova_Log_plot/Ancova_Power_plot
# ggsave(here("outputs", "Ancova_Comparison_plot.tiff"), Ancova_Comparison_plot, width = 20, height = 40, units = "cm", dpi = 300)
The patchwork library allows you to create composite
plots using mathematical operators. The division symbols here puts each
plot on top of the next plot, where the addition symbol would put the
next plot to the left. There are other packages that use different
approaches to combine multiple plots like the gridExtra and
cowplot packages. You can review the specific functions and
arguments for these packages in the help files.
The interpretation of these results from the Anova() and
the summary() output from the log model are as follows:
We fail to reject the null hypothesis that the effect of log(density)
on the yield equal across locations (F (1, 80) = 2.7632, p value =
0.1004). We reject the null hypothesis that the average yield is equal
across locations (F (1, 80) = 96.6693, p value < 0.001). We reject
the null hypothesis that there is no effect of density on yield (t =
27.142, df = 1, p value < 0.001). We can use the
emmeans() function to estimate a value of yield for
particular combinations of density and location parameters. These
combinations of parameters can be chosen based on values of interest
related to the experiment or sub-field.
emmeans::emmeans(Model_log,
specs = ~ density * loc,
at=list(density=c(75,125), loc = c("P","V")), # Specify values
type='response')
## density loc emmean SE df lower.CL upper.CL
## 75 P 121.5 2.60 80 116.4 126.7
## 125 P 73.6 3.83 80 66.0 81.2
## 75 V 87.3 2.83 80 81.7 92.9
## 125 V 44.9 4.32 80 36.3 53.5
##
## Confidence level used: 0.95
Now we will increase the complexity by including more than two independent (x) variables in the model and review how this impacts the steps in the analysis.
We are going to use the Penguins data again for this exercise. For now we will ignore the year variable - this will be covered as a part of the advanced modeling examples section.
Penguins <- read_csv(here("data", "Two_Way_ANOVA.csv"), col_types = "ffddddff") # We can change the col_types to factors (f) instead of characters (c)
str(Penguins)
## spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Gentoo",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Torgersen","Biscoe",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: num [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : num [1:344] 3750 3800 3250 NA 3450 ...
## $ sex : Factor w/ 2 levels "male","female": 1 2 2 NA 2 1 2 1 NA NA ...
## $ year : Factor w/ 3 levels "2007","2008",..: 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. species = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. island = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. bill_length_mm = col_double(),
## .. bill_depth_mm = col_double(),
## .. flipper_length_mm = col_double(),
## .. body_mass_g = col_double(),
## .. sex = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. year = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
## .. )
## - attr(*, "problems")=<externalptr>
# Remove NAs within the sex variable
Penguins <- subset(Penguins, !is.na(sex))
head(Penguins)
## # A tibble: 6 × 8
## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Adelie Torgersen 39.1 18.7 181 3750
## 2 Adelie Torgersen 39.5 17.4 186 3800
## 3 Adelie Torgersen 40.3 18 195 3250
## 4 Adelie Torgersen 36.7 19.3 193 3450
## 5 Adelie Torgersen 39.3 20.6 190 3650
## 6 Adelie Torgersen 38.9 17.8 181 3625
## # ℹ 2 more variables: sex <fct>, year <fct>
The data are loaded correctly and of the approriate type.
We can look at each of the variables individually and in pairs to determine which variables may impact the response (y) variable - body mass (g).
Penguin_Means_species <- Penguins %>%
group_by(species)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
Penguin_Means_sex <- Penguins %>%
group_by(sex)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
Penguin_Means_island <- Penguins %>%
group_by(island)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
Penguin_Means_species_sex <- Penguins %>%
group_by(species,sex)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
Penguin_Means_species_island <- Penguins %>%
group_by(species,island)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
## `summarise()` has grouped output by 'species'. You can override using the
## `.groups` argument.
Penguin_Means_island_sex <- Penguins %>%
group_by(island,sex)%>%
summarize(mean_mass = mean(body_mass_g),
std.dev = sd(body_mass_g),
count = n(),
std.error = sd(body_mass_g)/sqrt(n()))
## `summarise()` has grouped output by 'island'. You can override using the
## `.groups` argument.
Species_Plot <- ggplot(Penguins, aes(x = species, y = body_mass_g, col = species))+
geom_point(alpha = 0.4)+
geom_point(data = Penguin_Means_species, aes(x = species, y = mean_mass), size = 4)+
labs(x = "Species", y = "Body Mass (g)")+
scale_color_brewer(palette = "Dark2")+
theme_classic()
Sex_Plot <- ggplot(Penguins, aes(x = sex, y = body_mass_g, col = sex))+
geom_point(alpha = 0.4)+
geom_point(data = Penguin_Means_sex, aes(x = sex, y = mean_mass), size = 4)+
labs(x = "Sex", y = "Body Mass (g)")+
scale_color_brewer(palette = "Dark2")+
theme_classic()
Island_Plot <- ggplot(Penguins, aes(x = island, y = body_mass_g, col = island))+
geom_point(alpha = 0.4)+
geom_point(data = Penguin_Means_island, aes(x = island, y = mean_mass), size = 4)+
labs(x = "Island", y = "Body Mass (g)")+
scale_color_brewer(palette = "Dark2")+
theme_classic()
Species_Sex_Plot <- ggplot(Penguins, aes(x = species, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(group = sex), position = position_dodge(width = 0.5), alpha = 0.4)+
geom_point(data = Penguin_Means_species_sex, aes(x = species, y = mean_mass, col = sex, group = sex), size = 4, position = position_dodge(width = 0.5))+
labs(x = "Species", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2")+
theme_classic()
Species_Island_Plot <- ggplot(Penguins, aes(x = species, y = body_mass_g, col = island, group = island))+
geom_point(aes(group = island), position = position_dodge(width = 0.5), alpha = 0.4)+
geom_point(data = Penguin_Means_species_island, aes(x = species, y = mean_mass, col = island, group = island), size = 4, position = position_dodge(width = 0.5))+
labs(x = "Species", y = "Body Mass (g)", color = "Island")+
scale_color_brewer(labels = c("Biscoe", "Dream", "Torgersen"), palette = "Dark2")+
theme_classic()
Island_Sex_Plot <- ggplot(Penguins, aes(x = island, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(group = sex), position = position_dodge(width = 0.5), alpha = 0.4)+
geom_point(data = Penguin_Means_island_sex, aes(x = island, y = mean_mass, col = sex, group = sex), size = 4, position = position_dodge(width = 0.5))+
labs(x = "Island", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2")+
theme_classic()
Length_Sex_Plot <- ggplot(Penguins, aes(x = bill_length_mm, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(x = bill_length_mm, y = body_mass_g, col = sex, group = sex))+
labs(x = "Bill Length (mm)", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2")+
theme_classic()
Length_Species_Plot <- ggplot(Penguins, aes(x = bill_length_mm, y = body_mass_g, col = species, group = species))+
geom_point(aes(x = bill_length_mm, y = body_mass_g, col = species, group = species))+
labs(x = "Bill Length (mm)", y = "Body Mass (g)", color = "Species")+
scale_color_brewer(labels = c("Adelie", "Chinstrap", "Gentoo"), palette = "Dark2")+
theme_classic()
Length_Island_Plot <- ggplot(Penguins, aes(x = bill_length_mm, y = body_mass_g, col = island, group = island))+
geom_point(aes(x = bill_length_mm, y = body_mass_g, col = island, group = island))+
labs(x = "Bill Length (mm)", y = "Body Mass (g)", color = "Island")+
scale_color_brewer(labels = c("Torgersen", "Biscoe", "Dream"), palette = "Dark2")+
theme_classic()
Flipper_Sex_Plot <- ggplot(Penguins, aes(x = flipper_length_mm, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(x = flipper_length_mm, y = body_mass_g, col = sex, group = sex))+
labs(x = "Flipper Length (mm)", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2")+
theme_classic()
Flipper_Species_Plot <- ggplot(Penguins, aes(x = flipper_length_mm, y = body_mass_g, col = species, group = species))+
geom_point(aes(x = flipper_length_mm, y = body_mass_g, col = species, group = species))+
labs(x = "Flipper Length (mm)", y = "Body Mass (g)", color = "Species")+
scale_color_brewer(labels = c("Adelie", "Chinstrap", "Gentoo"), palette = "Dark2")+
theme_classic()
Flipper_Island_Plot <- ggplot(Penguins, aes(x = flipper_length_mm, y = body_mass_g, col = island, group = island))+
geom_point(aes(x = flipper_length_mm, y = body_mass_g, col = island, group = island))+
labs(x = "Flipper Length (mm)", y = "Body Mass (g)", color = "Island")+
scale_color_brewer(labels = c("Torgersen", "Biscoe", "Dream"), palette = "Dark2")+
theme_classic()
Depth_Sex_Plot <- ggplot(Penguins, aes(x = bill_depth_mm, y = body_mass_g, col = sex, group = sex))+
geom_point(aes(x = bill_depth_mm, y = body_mass_g, col = sex, group = sex))+
labs(x = "Bill Depth (mm)", y = "Body Mass (g)", color = "Sex")+
scale_color_brewer(labels = c("Female", "Male"), palette = "Dark2")+
theme_classic()
Depth_Species_Plot <- ggplot(Penguins, aes(x = bill_depth_mm, y = body_mass_g, col = species, group = species))+
geom_point(aes(x = bill_depth_mm, y = body_mass_g, col = species, group = species))+
labs(x = "Bill Depth (mm)", y = "Body Mass (g)", color = "Species")+
scale_color_brewer(labels = c("Adelie", "Chinstrap", "Gentoo"), palette = "Dark2")+
theme_classic()
Depth_Island_Plot <- ggplot(Penguins, aes(x = bill_depth_mm, y = body_mass_g, col = island, group = island))+
geom_point(aes(x = bill_depth_mm, y = body_mass_g, col = island, group = island))+
labs(x = "Bill Depth (mm)", y = "Body Mass (g)", color = "Island")+
scale_color_brewer(labels = c("Torgersen", "Biscoe", "Dream"), palette = "Dark2")+
theme_classic()
Factor_Plot <- Species_Plot/Sex_Plot/Island_Plot
Two_Way_Factor_Plot <- Species_Sex_Plot/Species_Island_Plot/Island_Sex_Plot
Lenght_Plot <- Length_Species_Plot/Length_Sex_Plot/Length_Island_Plot
Flipper_Plot <- Flipper_Species_Plot/Flipper_Sex_Plot/Flipper_Island_Plot
Depth_Plot <- Depth_Species_Plot/Depth_Sex_Plot/Depth_Island_Plot
This is a lot of code to investigate many combinations of the different independent (x) variables. By taking the time to look through these figures we get a better understanding of what possible patterns are going on with the data. This process is especially helpful if we don’t have the direct theoretical guidance from the sub-field to test for specific hypotheses.
Looking at the categorical variables individually
Species: Gentoo penguins may have larger body mass, hard to tell between Adelie and Chinstrap Sex: males may have larger body mass Island: Biscoe penguins may have larger body mass, hard to tell between Torgersen and Dream
Looking at the interaction between categorical variables
SpeciesSex: Does the difference in body mass between sexes for each species look different? SpeciesIsland: There isn’t full replication for each species on each island - can’t test full factorial Island*Sex: Does the difference in body mass between sexes for each island look different?
Looking at the interaction between each continuous variable and a categorical variable
Bill LengthSpecies: Adelie and Chinstrap seem to have same slope and intercept, Gentoo has lower intercept - slope maybe different Bill LengthSex: Male and Female penguins seem to have same slope and intercept Bill Length*Island: Dream and Torgersen seem to have same slope and intercept, Biscoe seems to have larger slope - intercept maybe different
Flipper LengthSpecies: All species seem to have the same slope and intercept - they just have a different range of flipper length values Flipper LengthSex: Same with the different sexes, but the two clusters we see is related to the species clustering we see - keep this in mind Flipper Length*Island: Same with the different islands - the ranges of flipper lengths for species and island are related as we saw in the specie by island graph before
Bill DepthSpecies: All species seem to have the same slope but different intercept for the Chinstrap penguins Bill DepthSex: Hard to interpret this plot given the inter-relatedness with the species plot Bill Depth*Island: Patterns seen in this figure seem to be driven by differences between species - remember there isn’t full factorial species:island combination
There is evidence for a possible three-way interaction where the differences in body mass depends on the continuous variables and both the species and sex. Three-way interactions are notoriously difficult to interpret, so we will walk through this analysis carefully.
After reviewing all these graphs we can fit a model that tests the three-way interaction of each Bill Depth, Flipper length, and Bill length and their interaction with Species and Sex on Body Mass. We will also include an interaction effect of Island and sex on Body Mass (we can’t test the three-way interaction with Island, Sex, and Species, because each Species is not represented on each Island). There don’t seem to be any curvilinear relationships with any of the continuous variables to consider. This could be argued as the “full model” given the logical steps we just went through. We can review other candidate models with fewer parameters to determine which model best fits the data.
Model_Full <- lm(body_mass_g ~ bill_depth_mm * species * sex + flipper_length_mm * species * sex + bill_length_mm * species * sex + island * sex, data = Penguins)
check_model(Model_Full, detrend = FALSE)
Mod <- lm(body_mass_g ~ bill_length_mm + flipper_length_mm + bill_depth_mm + species + sex + island, data = Penguins)
summary(Mod)
##
## Call:
## lm(formula = body_mass_g ~ bill_length_mm + flipper_length_mm +
## bill_depth_mm + species + sex + island, data = Penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -779.20 -167.35 -3.16 179.37 914.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1160.868 613.037 -1.894 0.059164 .
## bill_length_mm 18.189 7.136 2.549 0.011270 *
## flipper_length_mm 16.239 2.939 5.524 6.80e-08 ***
## bill_depth_mm 67.575 19.821 3.409 0.000734 ***
## speciesGentoo 987.761 137.238 7.197 4.30e-12 ***
## speciesChinstrap -260.306 88.551 -2.940 0.003522 **
## sexfemale -387.224 48.138 -8.044 1.66e-14 ***
## islandBiscoe 48.064 60.922 0.789 0.430722
## islandDream 34.961 57.569 0.607 0.544087
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 287.9 on 324 degrees of freedom
## Multiple R-squared: 0.8752, Adjusted R-squared: 0.8721
## F-statistic: 284.1 on 8 and 324 DF, p-value: < 2.2e-16
car::Anova(Mod, type = 3)
## Anova Table (Type III tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## (Intercept) 297265 1 3.5858 0.0591643 .
## bill_length_mm 538552 1 6.4964 0.0112698 *
## flipper_length_mm 2529936 1 30.5181 6.801e-08 ***
## bill_depth_mm 963531 1 11.6229 0.0007338 ***
## species 7733709 2 46.6451 < 2.2e-16 ***
## sex 5364097 1 64.7060 1.655e-14 ***
## island 56214 2 0.3391 0.7126979
## Residuals 26859432 324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
check_model(Mod, detrend = FALSE)
The assessment of the residuals looks pretty good. The normality of residuals falls along the q-q line. The linearity and homogeneity of variance are flat and horizontal. The collinearity again is higher in the presence of several interaction effects - not a concern. There are no observations identified as outliers or influential observations. The clustering we see with the residuals in the Linearity and Homogeneity of Variance plots are likely due to the observations made in each year - we are ignoring this structure in the data for now, but it is valuable to notice how this un-modeled structure can appear in these residual plots.
summary(Model_Full)
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_mm * species * sex + flipper_length_mm *
## species * sex + bill_length_mm * species * sex + island *
## sex, data = Penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -717.90 -178.89 -0.44 172.17 850.71
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -1036.211 1174.886 -0.882
## bill_depth_mm 44.650 33.629 1.328
## speciesGentoo 2475.154 1822.435 1.358
## speciesChinstrap -4447.067 2390.993 -1.860
## sexfemale -142.455 1796.119 -0.079
## flipper_length_mm 17.526 5.544 3.161
## bill_length_mm 19.444 15.400 1.263
## islandBiscoe 109.924 86.845 1.266
## islandDream 97.730 81.522 1.199
## bill_depth_mm:speciesGentoo -1.309 64.564 -0.020
## bill_depth_mm:speciesChinstrap -32.862 84.261 -0.390
## bill_depth_mm:sexfemale 59.926 48.845 1.227
## speciesGentoo:sexfemale -4941.191 3068.806 -1.610
## speciesChinstrap:sexfemale 4700.141 3321.649 1.415
## speciesGentoo:flipper_length_mm -10.344 9.753 -1.061
## speciesChinstrap:flipper_length_mm 20.881 10.516 1.986
## sexfemale:flipper_length_mm -5.927 8.096 -0.732
## speciesGentoo:bill_length_mm 14.183 21.856 0.649
## speciesChinstrap:bill_length_mm 8.341 37.864 0.220
## sexfemale:bill_length_mm -4.660 22.613 -0.206
## sexfemale:islandBiscoe -136.844 119.913 -1.141
## sexfemale:islandDream -141.676 113.460 -1.249
## bill_depth_mm:speciesGentoo:sexfemale 3.229 107.258 0.030
## bill_depth_mm:speciesChinstrap:sexfemale 46.441 111.876 0.415
## speciesGentoo:sexfemale:flipper_length_mm 27.926 15.145 1.844
## speciesChinstrap:sexfemale:flipper_length_mm -23.679 14.769 -1.603
## speciesGentoo:sexfemale:bill_length_mm -15.184 33.924 -0.448
## speciesChinstrap:sexfemale:bill_length_mm -7.424 44.382 -0.167
## Pr(>|t|)
## (Intercept) 0.37849
## bill_depth_mm 0.18527
## speciesGentoo 0.17542
## speciesChinstrap 0.06386 .
## sexfemale 0.93684
## flipper_length_mm 0.00173 **
## bill_length_mm 0.20770
## islandBiscoe 0.20657
## islandDream 0.23153
## bill_depth_mm:speciesGentoo 0.98384
## bill_depth_mm:speciesChinstrap 0.69681
## bill_depth_mm:sexfemale 0.22082
## speciesGentoo:sexfemale 0.10840
## speciesChinstrap:sexfemale 0.15809
## speciesGentoo:flipper_length_mm 0.28974
## speciesChinstrap:flipper_length_mm 0.04796 *
## sexfemale:flipper_length_mm 0.46470
## speciesGentoo:bill_length_mm 0.51688
## speciesChinstrap:bill_length_mm 0.82580
## sexfemale:bill_length_mm 0.83686
## sexfemale:islandBiscoe 0.25468
## sexfemale:islandDream 0.21274
## bill_depth_mm:speciesGentoo:sexfemale 0.97600
## bill_depth_mm:speciesChinstrap:sexfemale 0.67836
## speciesGentoo:sexfemale:flipper_length_mm 0.06616 .
## speciesChinstrap:sexfemale:flipper_length_mm 0.10990
## speciesGentoo:sexfemale:bill_length_mm 0.65477
## speciesChinstrap:sexfemale:bill_length_mm 0.86727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 278.1 on 305 degrees of freedom
## Multiple R-squared: 0.8904, Adjusted R-squared: 0.8807
## F-statistic: 91.76 on 27 and 305 DF, p-value: < 2.2e-16
car::Anova(Model_Full, type = 3)
## Anova Table (Type III tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## (Intercept) 60179 1 0.7779 0.37849
## bill_depth_mm 136378 1 1.7628 0.18527
## species 583495 2 3.7711 0.02411 *
## sex 487 1 0.0063 0.93684
## flipper_length_mm 773036 1 9.9922 0.00173 **
## bill_length_mm 123328 1 1.5941 0.20770
## island 152903 2 0.9882 0.37343
## bill_depth_mm:species 12066 2 0.0780 0.92500
## bill_depth_mm:sex 116448 1 1.5052 0.22082
## species:sex 513805 2 3.3207 0.03744 *
## species:flipper_length_mm 538036 2 3.4773 0.03212 *
## sex:flipper_length_mm 41460 1 0.5359 0.46470
## species:bill_length_mm 32677 2 0.2112 0.80974
## sex:bill_length_mm 3286 1 0.0425 0.83686
## sex:island 145889 2 0.9429 0.39064
## bill_depth_mm:species:sex 13552 2 0.0876 0.91616
## species:sex:flipper_length_mm 651962 2 4.2136 0.01566 *
## species:sex:bill_length_mm 15519 2 0.1003 0.90460
## Residuals 23596035 305
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are many different approaches one can take following the fitting of a complex multiple linear regression model like this. Various model selection processes like forward or backward selection methods are discussed in the literature with pros and cons. The dropping of non-significant covariates and/or terms deemed secondary to the main focus of your hypotheses are also discussed. The other extreme end of the philosophy is to retain each term and report the results as is. Like many other contested opinions, there are nuances to this decision that may impact your individual choice. I am not advocating for one over the other, just that you should be aware of the implications for each approach.
Model selection methods have the potential for misuse and lead to sub-optimal models or other issues. While leaving in all of these terms in the model takes up degrees of freedom where that can be problematic with small sample sizes and the residual degrees of freedom is already small. Whatever choice you use make sure you are transparent in reporting the details of the steps and reference your sub-field specific literature for examples on how to report these steps.
For this example we will spend some time considering subsets of models that fit within our understanding of the patterns within the data and compare model fit statistics to select the best fitting model.
# There wasn't too much difference between islands and a lot of variance around their means
Model_No_Island <- lm(body_mass_g ~ bill_depth_mm * species * sex + flipper_length_mm * species * sex + bill_length_mm * species * sex, data = Penguins)
# Bill length seemed to have the same slope for each species and the different intercept for each species is captured by other the term in the model
Model_No_Length_int <- lm(body_mass_g ~ bill_depth_mm * species * sex + flipper_length_mm * species * sex + bill_length_mm + island, data = Penguins)
# The three-way interaction seems to be related to the flipper length variable
Model_Flipper_int <- lm(body_mass_g ~ flipper_length_mm * species * sex + bill_length_mm + bill_depth_mm + island, data = Penguins)
car::Anova(Model_Flipper_int, type = 3)
## Anova Table (Type III tests)
##
## Response: body_mass_g
## Sum Sq Df F value Pr(>F)
## (Intercept) 93647 1 1.2346 0.2673621
## flipper_length_mm 692716 1 9.1323 0.0027161 **
## species 823634 2 5.4291 0.0048046 **
## sex 442 1 0.0058 0.9392211
## bill_length_mm 754827 1 9.9511 0.0017616 **
## bill_depth_mm 954463 1 12.5829 0.0004481 ***
## island 47262 2 0.3115 0.7325489
## flipper_length_mm:species 546772 2 3.6041 0.0283323 *
## flipper_length_mm:sex 10897 1 0.1437 0.7049216
## species:sex 717001 2 4.7262 0.0094939 **
## flipper_length_mm:species:sex 659762 2 4.3489 0.0137007 *
## Residuals 24045664 317
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
compare_performance(Model_Full, Model_No_Island, Model_No_Length_int, Model_Flipper_int)
## # Comparison of Model Performance Indices
##
## Name | Model | AIC (weights) | AICc (weights) | BIC (weights)
## ------------------------------------------------------------------------------
## Model_Full | lm | 4722.1 (<.001) | 4727.8 (<.001) | 4832.5 (<.001)
## Model_No_Island | lm | 4716.6 (0.002) | 4720.8 (<.001) | 4811.8 (<.001)
## Model_No_Length_int | lm | 4711.3 (0.031) | 4714.5 (0.016) | 4795.0 (<.001)
## Model_Flipper_int | lm | 4704.4 (0.967) | 4706.3 (0.983) | 4769.1 (>.999)
##
## Name | R2 | R2 (adj.) | RMSE | Sigma
## -----------------------------------------------------------
## Model_Full | 0.890 | 0.881 | 266.193 | 278.144
## Model_No_Island | 0.890 | 0.881 | 267.191 | 277.373
## Model_No_Length_int | 0.889 | 0.882 | 267.457 | 276.312
## Model_Flipper_int | 0.888 | 0.883 | 268.718 | 275.416
anova(Model_Flipper_int,Model_Full)
## Analysis of Variance Table
##
## Model 1: body_mass_g ~ flipper_length_mm * species * sex + bill_length_mm +
## bill_depth_mm + island
## Model 2: body_mass_g ~ bill_depth_mm * species * sex + flipper_length_mm *
## species * sex + bill_length_mm * species * sex + island *
## sex
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 317 24045664
## 2 305 23596035 12 449628 0.4843 0.9234
The comparison of adjusted R^2 and other model fit statistics indicate the model with only the three-way interaction with flipper length as the best fitting model. There are some close values to compare, but this model is also the simpler (parsimonious) model that retains all the terms in the model. You may have noticed that I kept the term for Island in the model. By keeping the non-significant term in the model we are effectively stating that the covariate has a small but non-zero effect. If the term is removed, then we are explicitly setting the effect to be zero - when the results from the model haven’t necessarily confirmed this. The model tells us that the estimate could be zero, but it could also be a value within the measure of standard error around zero.
We did effectively set the three-way interaction terms with bill length and bill depth to be zero, but we have retained the main effects of species and sex and their interactions in the model. This is a balanced middle ground approach that employs both motivations. Ultimately, you should be transparent and let the large scientific community scrutinize your results to ensure the accuracy of the results.
This model has many more variables in the model and there are
additional steps to visualize the complexity of these results. We will
use the expand.grid() function again, but we need to
include every term that is fit in the model. The interpretation of these
results are for the effect of flipper length and the interaction with
species and sex for the different island for an average bill length and
bill depth. This is a common method of interpreting terms from a
multiple linear regression model.
# Create new data - sequence of flipper length values from min to max, over levels of species, sex, and island variable, and at the mean of bill length and depth
# Mean values are chosen for other continuous variables because they are continuous variables - the mean value is a real value that could have been measured
# There are times where the median value should be chosen - when there is an ordinal or integer value
NewData <- expand.grid(flipper_length_mm = seq(min(Penguins$flipper_length_mm), max(Penguins$flipper_length_mm), length.out = 1000), species = levels(Penguins$species), sex = levels(Penguins$sex), island = levels(Penguins$island), bill_length_mm = mean(Penguins$bill_length_mm), bill_depth_mm = mean(Penguins$bill_depth_mm))
# Create predicted values of yield with standard errors from model using new data
yhat <- as.data.frame(predict(Model_Flipper_int, newdata = NewData, se.fit = T))
# Combine the two data sets
NewData <- cbind(NewData,yhat)
# Another way to rename the fit variable
NewData$body_mass_g <- NewData$fit
head(NewData)
## flipper_length_mm species sex island bill_length_mm bill_depth_mm
## 1 172.0000 Adelie male Torgersen 43.99279 17.16486
## 2 172.0591 Adelie male Torgersen 43.99279 17.16486
## 3 172.1181 Adelie male Torgersen 43.99279 17.16486
## 4 172.1772 Adelie male Torgersen 43.99279 17.16486
## 5 172.2362 Adelie male Torgersen 43.99279 17.16486
## 6 172.2953 Adelie male Torgersen 43.99279 17.16486
## fit se.fit df residual.scale body_mass_g
## 1 3653.706 128.5845 317 275.4157 3653.706
## 2 3654.614 128.3274 317 275.4157 3654.614
## 3 3655.523 128.0704 317 275.4157 3655.523
## 4 3656.431 127.8136 317 275.4157 3656.431
## 5 3657.339 127.5570 317 275.4157 3657.339
## 6 3658.247 127.3006 317 275.4157 3658.247
Flipper_Three_Way_plot <- ggplot()+
facet_wrap(~sex)+
geom_point(data = Penguins, aes(x = flipper_length_mm, y = body_mass_g, col = species))+
geom_line(data = NewData, aes(x=flipper_length_mm, y = body_mass_g, col = species))+
geom_ribbon(data = NewData, aes(x=flipper_length_mm, ymin = body_mass_g - se.fit, ymax = body_mass_g + se.fit, fill = species), alpha = 0.3)+
labs(x = "Flipper Length (mm)", y = "Body Mass (g)") +
guides(fill = guide_legend(title="Species"), color = guide_legend(title = "Species"))+
scale_color_brewer(labels = c("Adelie", "Gentoo", "Chinstrap"), palette = "Dark2") +
scale_fill_brewer(labels = c("Adelie", "Gentoo", "Chinstrap"), palette = "Dark2") +
theme_classic()
ggsave(here("outputs", "Flipper_Three_Way_plot.tiff"), Flipper_Three_Way_plot, width = 20, height = 15, units = "cm", dpi = 300)
This faceted figure allows us to see the individual trend lines for
each species and sex of penguin to see the relationship between flipper
length and body mass. There is another function in the
emmeans package called emtrend() that allows
us to test for the comparison of these slopes similar to how the
emmeans() function works.
Flipper_Trends <- Model_Flipper_int %>%
emtrends(~species | sex, var = "flipper_length_mm",cov.reduce = range) %>%
cld(Letters = letters, decreasing = TRUE) %>%
as_tibble()
Here is possible language you could use to describe the interpretation of the results for the three-way interaction.
There was a significant three-way interaction effect where the
relationship between flipper length and body mass was different among
species of penguins based on the sex of penguin (F (2, 317) = 4.3489, p
value = 0.0137007). Results from the nested comparison of estimated
marginal means of lienar trends of flipper length among species for each
sex was tested using the emtrends function from the
emmeans package (Lenth 2025). When considering the slopes
these trend lines within each sex the slope for the male Gentoo penguins
is smaller than the male Chinstrap penguins. There are no differences in
slopes among the species for the female penguins.